Business understanding:

The Citywide Payroll data was collected by the City of New York because of public interest in how the city is using its budget for salary and overtime pay for all municipal employees. The goal in publicizing this data was that it can then be used to analyze how the city's budget is being allocated and how much of the budget is being allocated to overtime. The data could be used to analyze how pay or amount of overtime varies by position or agency, and it may also be possible to make classification predictions such as job title or workplace borough based on factors such as agency, pay, or experience.

The first value we looked at predicting was base salary. There are several values that may affect base salary such as job title, work location, work agency, or start date. Similarly, overt time hours and overtime paid could also be predicted from the dataset using similar predictors. In any of these cases the goal would be to find the algorithm that minimizes the RMSE. The RMSE tells us how much error we have plus or minus from the actual value, so the smaller the RMSE, the closer the prediction is to the real value.

If we were predicting something like job title or workplace location based on pay and other factors, I would look to increase not only the classification accuracy, but the sensitivity and specificity of the algorithm as well. An algorithm can be accurate but could have a very high false positive or negative rate and it is important to make sure to minimize false predictions. To retrieve the accuracy values, we would validate the algorithm using 10-fold cross validation.

Data Meaning Type

The data consist of the following 17 attributes in 3,923,290 rows:

Year – The fiscal year data is being reported for. Spans 2014 – 2020. Numeric value.

Payroll Number – Unique numeric identifier for the payroll agency. 157 unique numbers. Numeric value.

Agency Name – The Payroll agency that the employee works for. 165 unique agencies. Text.

Last Name – Employee’s last name. Text.

First Name – Employee’s first name. Text.

Mid Init – Employee’s middle initial. Text.

Agency Start Date – Date which employee began working for their current agency. Date and time field as MM/DD/YYYY.

Work Location Borough – Borough of employee’s primary work location. 22 unique values. Text.

Title Description - Civil service title of employee. 1,802 unique job titles. Text.

Leave Status as of June 30 – NYC Fiscal Year runs July 1st to June 30th. Leave status of employee as of closing of relevant fiscal year. Valid values are Active, Ceased, On Leave, On Separation Leave, or Seasonal. Text.

Base Salary – Employee’s Base Salary measured in individual dollars and cents. Numeric value.

Pay Basis – How employee is paid. Valid values are per Annum, per Hour, Prorated Annual, or per Day. Text.

Regular Hours – Number of regular hours employee worked in the fiscal year. Numeric value.

Regular Gross Paid – The amount paid to the employee for base salary during the fiscal year. Measured in individual dollars and cents. Numeric value.

OT Hours – Overtime hours worked by employee in the fiscal year. Numeric value.

Total OT Paid – Total overtime pay paid to employee in the fiscal year. Measured in individual dollars and cents. Numeric Value.

Total Other Pay – includes any compensation in addition to gross salary overtime. Includes bonuses, meal allowances, uniform allowance, etc. Measured in individual dollars and cents. Numeric value.

It is important to note that Base Salary fluctuates with Pay Basis. For example, hourly employees have their hourly rate listed while per annum employees have their yearly salary listed. To normalize this column, we multipled the base rate for the hourly employees by their regular hours worked to calculate an annual salary. For per day employees, we divided the Regular Hours by 8 to caclulate an hourly rate and then multipled the new hourly rate by the number of regular hours worked to calculate an annual salary. An alternate option would have been to calculate the hourly rates of employees paid by yearly or daily salary, however because each row represents an entire fiscal year it made more sense to us to calculate a yearly salary for each employee.

Data Quality

The overall data quality is excellent with only a little more than 5% of the total dataset missing and only 17 duplicate rows. As the duplicates represent such a small fraction of the total dataset, it is probably safe to assume that these represent data entry errors and they can be dropped without affecting overall data quality. It is important to note, however, that the data dictionary for this dataset states that all entries were manually entered and there may be data entry errors that are not obvious without knowing more about the data.

Regarding the missing values, about 41% of the middle initial category is missing. This is not concerning as it has no relevance to making predictions and middle initial, first, and last name can all be dropped safely without affecting data quality.

Payroll number is missing about 45% of its data. Of the 165 unique agency names, there are 162 that have errors in their payroll numbers. The Police Department and all 6 District Attorney’s offices do not have payroll numbers. Similarly, the department of Juvenile Justice is also missing a payroll number for 97.73% of its entries. However, this is to be expected. The data dictionary for this dataset specifically states that information for some agencies, including the NYPD and DA offices, have been redacted to protect the safety of their employees. Since there are no null values for agency name and payroll number does not affect the quality of our predictions, we can drop payroll number without affecting data quality and treat the agency name as a unique identifier instead.

Work location borough is missing about 13% of its data or 506,226 rows. This is not confined to any one specific agency or job title as all job titles and agencies are missing data for work location. However, there are trends in the data that can help to explain some of the missing values. For example, per session teachers make up 14.74% of the entries without a primary work location (74,607/506,226). We also see large numbers for police officers, election workers, and both Fire Department and Parks and Recreation staff. It is quite possible that many of these workers rotate from location to location throughout the city and do not have a dedicated primary work location that they are assigned to. Although the numbers may appear large, the per session teachers without a work location only make up about 12.56% of the total number of per session teachers in the entire dataset (74,607/594,024). It may be possible to remove these rows as outliers, a unique subset of the data, or as data entry errors. This column also includes counties and other locations where New York City employees work and the data could be subsetted to only include the boroughs of New York City or the outlying counties, depending on where we wanted to place focus or emphasis.

The final attributes, agency start date and title description are missing 63 and 84 records respectively. These missing records may be caused data entry error and could be treated as outliers without affecting the overall data quality since they represent such a small portion of the data.

There are also some interesting values for the numeric columns that need to be looked at further. The salary columns, Base Salary and Gross Paid, show a long right tail which is normal for salary data and does not necessarily indicate outliers. There are negative values for Regular Gross Paid, Total OT Paid, and Total Other Pay. These may also not be outliers and may represent cases where employees left the city and had to pay back bonuses, uniform allowances, etc. or may represent adjustments where an employee left or there was an error in the pay or time calculations.

Unusually, there are also negative values for Regular Hours and Overtime hours. These entries might have been necessary to calculate where an employee left and was overpaid, but could also represent data entry errors. There are 1,186 Regular Hours with negative values and 3,688 OT hours with negative values. Because these entries represent less than 1% of the total data, it is probably safe to treat them as an error and to drop these rows.

Simple Statistics

For this section we are using our data set with no missing values. We will do some additional data cleanup here by addressing data types and negative values. Then we discuss range, mean, median, variance, and counts.

Negative Values

Notice that there are negative min values for most of our columns. We will look closer at these values to see if they can possibly be removed. Too many negative values will affect our ability to forecast salary unless we discover a valid reason for having negative values. They might also make it difficult to see the true variability in the data.

We omit the Fiscal Year column. Simple statistics for this attribute are not useful. Treating it as a continuous attribute is useful for sorting and performing comparison operations.

A count of the negative values reveals that there are not enough to effect anything if we remove them. Our dataset contains more than 3 million records. We will only remove around 29000.

Simple Statistics

Mean, Median, and Range

After removing the negative values, it is interesting to see that the minimum values for Regular Gross Paid, Base Salary, and Regular Hours are zero. It appears that there are people listed as employees but earn no salary. Zero Regular Hours makes sense because those that are paid annually would not have their hours tracked. The range of these values suggests high variability.

We know that salary data tends to be right skewed. That being the case, it makes more sense to focus on median salaries rather than the mean. Outliers tend to pull the mean to the right. The median is more accurate for this data set.

Variance

We have high variance across all of the continuous attributes. We know that the four classes of Pay Basis are included in Base Salary: per Annum, per Day, per Hour, and Prorated Annual. The variance for the remaining attributes is explained by the wide variety of job titles, the amount of compensation, and differences in how they are compensated. For example, a substitute teacher is paid per day. They don't work everyday. Their Regular Gross Paid attribute is included with that of a Commissioner who has a much higher salary.

Mode

The fact that Manhattan is the Borough that appears most often in this data set adds to the legitimacy of this data. One would expect Manhattan to have the highest number of employees in New York state. It also makes sense that since most employees are paid annually (per Annum), a zero value for OT Hours would be most common.

Counts

The high variation in base salary will make it difficult for our model to perdict. As we look closer at base salary, we noticed that it includes all classes of Pay Basis. That is why we are displaying it separately to show the statistics for the different classes of Pay Basis. Again, the minimum values of zero are interesting. Let's see how many employees are earning zero dollars.

Employees Earning Zero Dollars

There are very few emplyees with a salary or gross pay of 0 compared to the entire data set. The fact that we have a large number of employees with 0 hours worked is interesting. We will take a closer look at that. We assume that these are salaried employees.

Count of Employees with Zero Hours Worked by Pay Basis

This is the count of employees with zero hours broken down by Pay Basis. Employees who are paid per day, and annually add up to 1543763, which is 80% employees with zero hours. This make since, they are not paid by the hour, thus hours worked would not be tracked. The employees who get paid by the hour but show zero hours worked might throw off our model when we try to predict. The model might put all these in the same category but taking Pay Basis into consideration should remedy this situation.

Visualize Attributes

Top 10 Median Salaries by Agency

The Base Salary column contains four classes of salary. We are using 'Regular Gross Paid' where we can get the same insights if we were to convert the Base Salary column to annual figures. What's important to point out is that 'Regular Gross Paid' should be a strong predictor for a classification problem when we get into the higher salaries.

Top 10 Median Salaries by Job Title

There are 1763 unique job titles. Classifying them using a model will be more difficult than classifying the Agency. But our model should still perfrom well given the data.The higher the salary, the easier it will be to predict job title. When we get to the mid and lower salaries, Pay Basis will help with classifying.

This small table supports our Job Titles bar chart. We are showing the number of unique job titles.

Change in Base Salary Over Time

There are zero observations where Pay Basis is Prorated Annual in 2014. We notice the similar results for 'per Day' and 'per Hour' (very few observations). Over time, the trend is as expected for salary data, upward. It is likely that there was some policy change that allowed for prorated salaries, effective, fiscal year 2015. The implication here is that it might make more sense to train our model on fiscal years 2015 and up.

Per Hour:

The number of observations for the 2014 fiscal year for the Per Hour pay basis is 168. Additionally, 133 of those observed show to have a pay rate of 1 dollar an hour. Therefore, 2014 is excluded from this visual for the pay basis of 'per Hour'. There is significant increase from 2015 to 2020.

Per Day:

The number of observations for the 2014 fiscal year for the Per Day pay basis, and where base salary is above 50 dollars per day are both 15. Therefore, 2014 is excluded from this visual for the pay basis of 'per Day'. Additionnally, there is no change in median pay per day from 2015 to 2020. A quick look at the job titles associated with per day pay basis reveals that most of these positions are temorary. For example, there are 46311 substitute teachers. The amount that these workers receive is likely voted on by citizens rather than negotiated. We don't have data to back this up, but typically salaries increase over time for various reasons. They are either negotiated, or organizations keep their compensation practices in line with the industry in order to attract and retain the best talent

We are displaying this table to support our statements about the line charts above (The one with per Day and Per Hour). This shows the number of employees at 1 dollar per hour in 2014 and the number of employees above 50 dollars per hour in 2014. This backs up our resoning for not including fiscal year 2014.

The purpose of this table is to support our statement about the number of the different job titles that are paid per day.

Employees with the Highest Hourly Pay

This is not a mistake.

There are employees who earn between 400 and 500 per hour. There are very few and they all work in the Civil Service Commision. The implcations are that a classification model should easily identify the Agency Name but will have trouble with accurately identifying the title description. For this entire data set, we should look at median salaries. Any interpretations or conclusions based on the mean would be suspect.

We included this table to support the histogram with highest hourly pay above. This shows that the employees with the highest hourly pay are all from the same agency.

This must be a mistake.

What is interesting here is the extremely high hourly rate for the employees making more than 500 dollars per hour in the histogram. But notice in the table that all these employees have zero hours worked. It could be that the Pay Basis for this subset should be in the 'per Annum' class. We should move this subset to that class and perform some calculations to convert the hourly rate to an annual rate and place that into a new column named AnnualSalary when we perform our predictions. Or we can use the Regular Gross Paid column which will yield the same benefit for a classification problem.

This table supports our statement that the high salaries (>20000 per hour), with zero hours worked must be a mistake.

Explore Joint Attributes

We ran a scatter matrix on a random sample of 30000 observations from the data set to reduce processing time.

OT Hours

Over Time Hours is strongly corelated with Total Overtime Paid, with a positive correlation. But this is not the case between OT Hours and Regular Gross Paid. Regular Gross Paid is gross salary, including overtime. But not all employees work overtime, such as those who are paid annually.

Base Salary

Base salary and Regular Gross Paid certainly have a strong positive correlation. Here we have not converted Base Salary to all annual figures. The variation that is shown is due to Base Salary including the 'per Hour' and 'per Day' classes. If this Base Salary were converted to all annual figures, the correlation with RGP would be much stronger.

Regular Hours

Regular Hours and Regular Gross Paid have a positive correlation. This correlation would be stronger if we were only dealing with employees who have their hours tracked. Here again we see that there are many employees whose hours worked is not tracked. This makes Regular Hours a weak predictor for a regression problem. It will be strong for a classification problem. Separating job titles by those with hours worked and those with none is a good starting point.

We use the correlation heatmap to expand on Regular Hours. It is not as noticable in the scatterplot, but regular hours and Total OT Paid have a significant correlation. An employee with a lot of overtime will certainly have maxed out (or come close to maxing out) in Regular Hours. Regular hours would be a better attribute to use for a predictive model. There are more employees with regular hours than those who worked overtime. And all of the employees who worked overtime will have regular hours.

Explore Joint Attributes and Class

Lets Analyze which factors can be used for making potential predictions for income rates for municipal employees in NYC.

Theres a relationship between job title and base salary for employees who are getting an hourly rate. The civil service commission has employees who make an above average hourly rate compared to others in this same category of pay type (Hourly Rate).

When and where were the highest salaries in the New York area from 2014 to 2020 ?

The table below illustrates the difference and change in pay overtime for different job titles by year and county. This gives us a better idea of what the trend in pay is and that there seems to be a negative pattern overtime. (See "Years with the Largest and Smallest Gross Pay" below for illustration)

Years with the Largest and Smallest Gross Pay

We see here a negative trend for overall pay from 2014 to 2020. It would be interesting to also assess this similar trend after 2020 using a regression model to see how this trend has changed or remained the same post pandemic.

We wanted to analyze which jobs were the highest and lowest paying jobs. This metric will help us get a good metric on the min and max gross pay rates and which jobs are associated to those rates when we eventually create our prediction model for job title.

New Features

The Job category column was added to the dataset to help in the regression modelling process later on as we looked predict job title based on other variables in the dataset.

Other new features to be included would a Annual Salary column that takes the annual rate of pay for all pay types ("per Day, "per Hour", "per Annum", "per Annum prorated"). This will help us predict the overall salaries of employees when we get to model creation.

Exceptional Work

Is location a factor in who makes high or lower annual wages?

First lets group the data by location or in this case work location borough and by fiscal year so we accurately assess trends overtime. Its best to assess location statistics geographically to get the best understanding of how location comes into play when looking at the data. To that end we'll join the dataset to a shapefile of the counties of New York state.

The map below shows an interesting trend in the NY area. This indicates that the higher gross incomes are in the suburb and surrounding counties of NYC (The NYC area consisting on the NEW YORK, RICHMOND, QUEENS, KINGS, and BRONX counties). Compared to the overall population in the NYC area, the distribution of wealth seems to be all over the board compared to what seem to be the more affluent areas such as Westchester and Dutchess counties which consistently have the higher gross income levels overtime